{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CPSC 330 hw6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split, cross_validate, cross_val_score\n",
    "\n",
    "# Add your imports below\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cross_validate_std(*args, **kwargs):\n",
    "    \"\"\"Like cross_validate, except also gives the standard deviation of the score\"\"\"\n",
    "    res = pd.DataFrame(cross_validate(*args, **kwargs))\n",
    "    res_mean = res.mean()\n",
    "\n",
    "    res_mean[\"std_test_score\"] = res[\"test_score\"].std()\n",
    "    if \"train_score\" in res:\n",
    "        res_mean[\"std_train_score\"] = res[\"train_score\"].std()\n",
    "    return res_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions\n",
    "rubric={points:5}\n",
    "\n",
    "Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The dataset\n",
    "\n",
    "In this assignment we'll look at the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). The task is to predict whether a text message (SMS) is spam or not spam (\"ham\"). **Sorry for the offensive language in some text messages. If you are sensitive to such language you may wish to avoid reading the raw messages. I have attempted to design the assignment so that any messages you need to read are not disturbing ones.**\n",
    "\n",
    "You should start by downloading the dataset and extracting the csv to your current directory. As usual, please do not commit it to your repos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sms_df = pd.read_csv(\"spam.csv\", encoding=\"latin-1\")\n",
    "sms_df = sms_df.drop(columns=[\"Unnamed: 2\", \"Unnamed: 3\", \"Unnamed: 4\"])\n",
    "sms_df = sms_df.rename(columns={\"v1\": \"target\", \"v2\": \"sms\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train, df_test = train_test_split(sms_df, random_state=123)\n",
    "df_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this first part of the assignment, we'll build a classification model to predict whether a message is spam or ham."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = df_train[\"sms\"]\n",
    "y_train = df_train[\"target\"]\n",
    "\n",
    "X_test = df_test[\"sms\"]\n",
    "y_test = df_test[\"target\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1\n",
    "rubric={points:25}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Use `CountVectorizer` to create features from the text data.\n",
    "- Choose an appropriate baseline model (`DummyClassifier` or `DummyRegressor` to predict spam vs. ham and report the relevant scores\n",
    "- Choose an appropriate linear model (`LogisticRegression` or `Ridge`) to predict spam vs. ham\n",
    "- Choose an appropriate random forest model (`RandomForestClassifier` or `RandomForestRegressor`) to predict spam vs. ham\n",
    "- Report the relevant scores for your two models above. You can keep default hyperparameters for simplicity.\n",
    "- Report the most important features according to your linear model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2\n",
    "\n",
    "Let's now try to use pre-trained word embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we saw in class, using pre-trained word embeddings is very common in NLP. These embeddings are created by training a model like [Word2vec](https://en.wikipedia.org/wiki/Word2vec) on a huge corpus of text. In this exercise we will use a package called [spaCy](https://spacy.io/). Unfortunately, I didn't anticipate using spaCy at the start of the course, and thus it was not included in your course environment. You will need to install it now. Perform the following steps:\n",
    "\n",
    "1. Open a terminal and activate your cpsc330env environment\n",
    "2. Run `conda install spacy` if you're using conda and `cpsc330env.yml`, or `pip install spacy` if you're using pip and `requirements.txt`\n",
    "3. Run `python -m spacy download en_core_web_md`\n",
    "\n",
    "The last line downloads the trained language model itself, called [`en_core_web_md`](https://spacy.io/models/en#en_core_web_md). It is about 50 MB.\n",
    "\n",
    "When you are done, the following line of could should run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_md\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there are issues, please ask for help on Piazza."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(a)\n",
    "rubric={points:5}\n",
    "\n",
    "Our pre-trained `en_core_web_md` model gets us a vector representation of text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp(X_train.iloc[0]).vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is analogous to calling `transform` with `CountVectorizer`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare the _length of the representation_ for these embeddings vs. the `CountVectorizer` approach. Then, compare the _number of nonzero entries_ for the two repesentations of the first training example. Briefly discuss.\n",
    "\n",
    "Note: As briefly discussed in Lecture 14, a common error here is that scikit-learn methods expect certain data shapes as their input. To address this you can use `X_train.iloc[[0]]` instead of `X_train.iloc[0]`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(b)\n",
    "rubric={points:5}\n",
    "\n",
    "In Exercise 1 you used `CountVectorizer` to generate features, which were then fed into a model. We can do the same here with the features from the pre-trained embedding model. \n",
    "\n",
    "In this case, for computational reasons I will first get the embeddings for the entire train and test sets (note that this doesn't violate the Golden Rule because the transformation is independent for each example):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_embeddings = pd.DataFrame([sms.vector for sms in nlp.pipe(X_train)])\n",
    "X_test_embeddings  = pd.DataFrame([sms.vector for sms in nlp.pipe(X_test)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_embeddings.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What sort of scores can you get with these features instead? Compare with your scores from Exercise 1 and briefly discuss. Again, it's fine to stick to default hyperparameters to save time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(c)\n",
    "rubric={points:1}\n",
    "\n",
    "Score your models on the test data. Are the results what you expected?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3\n",
    "\n",
    "Now we're done with trying to predict class (spam vs. ham). Our next task will be trying to find similar messages to a query message using nearest neighbours, like the product recommendations we discussed in Lecture 14."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(a) \n",
    "rubric={points:5}\n",
    "\n",
    "Using scikit-learn's `NearestNeighbours` on the word count features from `CountVectorizer`, searching the training data to find the 5 most similar messages to this made-up message:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_sms = \"Hey how about some CPSC 330 studying over Zoom or socially distanced at a park? This course is so much fun right?\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use Euclidean distance and the same `CountVectorizer` you used in Exercise 1.\n",
    "\n",
    "Note: The `kneighbors` function returns indices of the neighbours. To retrieve the corresponding messages, I recommend indexing using the `iloc` syntax.\n",
    "\n",
    "Note: We don't exactly have a notion of train and test anymore, because we're not doing supervised learning anymore. I just picked the training set for simplicity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(b)\n",
    "rubric={points:2}\n",
    "\n",
    "Repeat part (a) but using cosine similarity instead of Euclidean distance. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(c)\n",
    "rubric={points:5}\n",
    "\n",
    "In lecture we talked about how Euclidean distance resulted in less popular items being recommended than with cosine similarity. What is the analog of \"popularity\" here? Are your results from parts (a) and (b) consistent with this notion?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(d)\n",
    "rubric={points:3}\n",
    "\n",
    "Repeat parts (a) and (b) but this time with the pre-trained embeddings from Exercise 2."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(e)\n",
    "rubric={points:2}\n",
    "\n",
    "Our first approach, using `CountVectorizer` features, should only retrieve similar messages if they have some words in common with the query message. Is this also true for the pre-trained embedding approach as well? Briefly discuss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3(f)\n",
    "rubric={points:2}\n",
    "\n",
    "In class we talked about how, when using pre-trained models, it's important that the original training data was somewhat similar to our own data. For example, in Lecture 13 we talked about how the dog breed images were fairly similar to ImageNet images. We're using the `en_core_web_md` pre-trained model from spaCy - its documentation is [here](https://spacy.io/models/en#en_core_web_md). Based on the documentation, it seems like the word vectors come from [Common Crawl](https://commoncrawl.org/). Do you think that training data is suitable for turning these SMS messages into feature vectors? Briefly discuss. (There is no single correct answer here!)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4: Very short answer questions\n",
    "rubric={points:8}\n",
    "\n",
    "Each question is worth 2 points.\n",
    "\n",
    "The first two questions pertain to the material we skipped in Lecture 12. A screencast is available on the course README; [here](https://www.dropbox.com/s/da7lx8kdzxfmna2/lecture12.mp4?dl=0) is the link for convenience."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Consider using `CountVectorizer(max_features=1000)` and then reducing to 500 features with `RFE(n_features_to_select=500)` vs. using `CountVectorizer(max_features=500)`. Are these two approaches the same? If not, how are they different?\n",
    "2. After running feature selection with `RFE`, `rfe.ranking_` tells you the order in which the features were removed. Why could this order be different from the order of the original feature importances, ranked from least to most important?\n",
    "3. In Lecture 13 we discussed how neural networks are sort of like `Pipeline`s, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?\n",
    "4. In Lecture 15 we saw our pre-trained word embedding model output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission to Canvas\n",
    "\n",
    "**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).\n",
    "\n",
    "When you are ready to submit your assignment do the following:\n",
    "\n",
    "1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.\n",
    "2. Save your notebook.\n",
    "3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.\n",
    "4. Run the code `submit()` below to go through an interactive submission process to Canvas.\n",
    ">For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. \n",
    "\n",
    "Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canvasutils.submit import submit, convert_notebook\n",
    "\n",
    "# Note: the canvasutils package should have been installed as part of your environment setup - \n",
    "# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert_notebook(\"hw6.ipynb\", \"html\")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# submit(course_code=53561, token=False)  # uncomment and run when ready to submit "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "name": "_merged",
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "438px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
